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Abstract. The VC dimension of the Ising per- 
ceptron with binary patterns is calculated by nu- 
merical enumerations for system sizes iV < 31. It 
is significantly larger than . The data suggest 
that there is probably no well defined asymptotic be- 
haviour for iV — !■ cx). 

The Vapnik-Chervonenkis(VC)-dimension is one 
of the central quantities used in both mathemat- 
ical statistics and computer science to character- 
ize the performance of classifier systems || . In 
the case of feed-forward neural networks it estab- 
lishes connections between the storage and gener- 
alization abilities of these systems 1^, Q D . In this 
letter we will discuss the VC dimension of the Ising- 
Perceptron with binary patterns. 

The VC-dimension dye is defined via the growth 
function A{p). Consider a set of instances x and 
a system C of binary classifiers c: x ^ { — 1,1} 
that group all a: G X into two classes labeled 
by 1 and —1 respectively. For any set {x^} of 
p different instances x^,... ,x^ we determine the 
number A(a;^,... ,x^) of different classifications 
c{x^),... ,c(x^) that can be induced by running 
through all classifiers c £ C. A set of instances 
is called shattered by the system C if A(a:-^, . . . , x^) 
equals 2^", the maximum possible number of differ- 
ent binary classifications of p instances. Large val- 
ues of A(a;^, . . . ,x^) roughly correspond to a large 
diversity of mappings contained in C . The growth 
function A(p) is now defined by 



A(p) = maxA(x\ . . . ,xP). 



(1) 



It is obvious that A(p) cannot decrease with p. 
Moreover for small p one expects that there is at 
least one shattered set of size p and hence A(p) — 
2P. On the other hand this exponential increase of 
the growth function is unlikely to continue for all 
p. The value of p where it starts to slow down gives 
a hint on the complexity of the system C. In fact 
the Saner lemma [|l|, |j states that for all systems C 
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of binary classifiers there exists a natural number 
dvc (which may be infinite) such that 



A{p) 




if 
if 



P < dvc 
P > dvc 



(2) 



dvc is called the VC-dimension of the system C. 
Note that it will in general depend on the set X of 
instances to be classified. 

A concrete example for a system of classifiers is 
given by the well known perceptrons defined by 



N 

sign(^ Ml 

i=l 



(3) 



where the weights J G parameterize the per- 
ceptron and ^ G is an instance or pattern to 
be classified. The multiplication of J by a constant 
factor does not affect the output ti, so the weights 
are usually restricted by = A. For this spheri- 
cal perceptron the exact result dvc — N has been 
obtained analytically [0. 

The Ising-perceptron is a spherical perceptron 
with the additional constraint J, = ±1 on the 



weigths. For real valued patterns ^ G 



this 



constraint does not affect the VC-dimension, i.e. 
dvc = N still holds §. 

Since much of the interest in neural networks 
with discrete weights is due to their easy technical 
implementation it is important to consider not only 
binary weigths but also binary patterns = ±1. 
To avoid problems with the sign-function if J-^ hap- 
pens to be 0, one introduces a threshold O = ±1 
for N even: cr = sign(J • ^-|- 6). Since the VC- 
dimension for the Ising-perceptron with N = 2n 
and 8 = ±1 is the same as for N — 2n+l without 
threshold, we will consider only odd values of N 
throughout this paper. 

The determination of the VC-dimension of the 
Ising-perceptron with binary patterns is a hard 
problem. Analytical calculations based on the 
replica method B] are not very helpful since this 
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method is suited to calculate typical or average 
quantities whereas the VC-dimension is an ex- 
tremal concept due to the max in eq. ^ For the 
spherical perceptron this difference does not really 
matter, but for networks with discrete weights it is 
crucial 

To get at least a lower bound for dvc it suffices 
to find a large shattered set by a smart guess. Con- 
sider the set {N odd): 
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(4) 



Let a = (do , • . . , <j n+i ) be an arbitrary output vec- 
tor. To see how t? can be realized by the binary 
perceptron, we have to distinguish two cases: 

First case: B — (ctq, cr, . . . , cr) i.e. the output val- 
ues for all patterns except are the same. This 
output can be realized by the weights 



J = (-0-, (To, 



(5) 



Second case: For all output vectors different from 
the first case, we can assert 



< 



N 



(6) 



since at least one ai in the sum differs from the 
rest. As weights we choose 



,/ = (^cr, ki, . . . , fc w-3 , a N+i , 

2 2 

where k can be any ±l-vector with 



(7) 



According to eq. ||, such a vector can always be 
found. Again we have sign(J • S,^^^) = cr^ for /i — 

This proves that the set 
hence 



dvc > 



iV + 3 



is shattered and 



(8) 



for the Ising-perceptron with binary patterns. This 
value of dvc agrees very well with numerical re- 
sults obtained by a statistical enumeration method 



]lO| , 1^. For this method, one randomly draws p 
binary patterns and calculates A(^'^-'^', . . . by 
enumeration of all perceptrons Je{±l}^. If a 
single pattern set with A(. . . ) = 2^ is found, we 
know that dvc ^ P- Like the replica method, this 
method is not suited to calculate the VC dimen- 
sion in cases where the maximum shattered sets 
are rare. 

There is however a method that guarantees the 
exact evaluation of the VC dimension: exhaustive 
enumeration of all shattered sets. The shattered 
pattern sets can be arranged as the nodes of a tree. 
The root of the tree is the empty pattern set (con- 
veniently defined to be shattered). The children of 
a P pattern node are formed by all those shattered 
(P -I- l)-pattern sets that can be obtained from the 
parent by adding a new pattern. The recursive ap- 
plication of this definition gives the complete tree of 
all shattered sets. The VC dimension is the height 
of the tree. It can be measured by a traversal of 
the complete tree using standard algorithms. 

The branching factor of the tree is 0(2^), its 
height is 0{N), giving an overall complexity of 
0(2^ ). This exponential complexity limits the 
reachable size N very soon and calls for some tricks 
to reduce the number of nodes. 

Before we can think of reducing the number of 
nodes, we must ensure that every node, i.e. every 
shattered set is considered only once. A ±l-pattern 
can be read as an A^-bit integer (identifying — 1 
with 0) , hence we have an order relation among the 
patterns. If we add only patterns to a set which 
are larger than the current elements of the sets, 
uniqueness of the nodes is guaranteed. 

The first trick to reduce the number of nodes 
exploits the symmetric of the problem: A shat- 
tered set remains shattered if we multiply one of 
its elements or the i-th entry of all elements by —1. 
Therefore we may restrict ourselves to pattern sets 
where all elements start with —1: ^ = (— 1,...) 
and we can fix the set containing the only pattern 
(—1, —1, . . . , —1) as the root of the tree. 

The second trick is of the branch and bound va- 
riety and exploits the fact, that we are not inter- 
ested in the complete tree but only in its height. 
Let us assume that we have an easy to calculate 
upper bound for the maximum height that can be 
reached from a given node. If this upper bound 
turns out to be lower than the maximum height 
already found during our traversal, we can safely 
renounce the exploration of the subtree rooted in 
this node! 

The binary outputs of a set of P patterns can 
be interpreted as P-bit number c. Iterating over 
all 2^ binary weight vectors of our network, we get 
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2^ such output numbers c. If P < N, some of the c 
values must appear more than once. Let fc denote 
the frequency of the output value c. The number of 
different classifications of this pattern set is given 
by the number of fc > 0: 



(9) 



The /c's have to be calculated at each node to test 
whether the pattern set is shattered or not. If 



/mill = min{/c} 



(10) 



is 0, the pattern set is not shattered (at least one 
classification c has not been realized). If /min > 
the pattern set is shattered and we can try to en- 
hance it. Each new pattern can split an existing 
classification into two (appending a — 1 to c for 
some weight vectors and a +1 for others), i.e. from 
each classification c we get 2 new classifications ci 
and C2 with = /d + fc2 ■ One of the new frequen- 
cies is always < /c/2. Therefore we have logj /min 
as an upper bound for the number of patterns that 
can be added to a shattered set before we definitely 
get a non shattered set. 

This strategy allows to prune many subtrees. For 
N = 5, branch and bound reduces the number of 
nodes from 77 to 4, for = 7 from 8389 to 4625. 

Even with these tricks, the complexity 0{2^ ) 
is overwhelming. On an UltraSparc I 170, the ex- 
haustive enumeration for = 7 takes less than a 
second. For N = 9, the running time is 6.5 hours! 
Nevertheless, the results obtained for iV < 9 are 
already quite remarkable. For N = 7, the set 



^1) 

^13) 
^5) 
|I7) 



-1,-1 - 1,+1,+1,+1,+1) 
-1,+1 + 1,-1,-1,+1,+1) 
-1, +1 + 1, +1, +1,-1,-1) 
+ 1,-1 + 1,-1,+1,-1,+1) 
+ 1,-1 + 1, +1,-1, +1,-1) 
+ 1,+1 - 1,-1,+1,+1,-1) 
+ 1,+1 - 1,+1,-1,-1,+1) 



(11) 



is shattered, hence dvc = 7 - the maximum possi- 
ble value! Together with dvc = 4 for = 5 and 
dvc = 7 for = 9, these results do not allow a 
decent conjecture for the general expression. How- 
ever, partial enumerations for larger values of A^ 
indicate, that dvc is substantially larger than the 
value ^^^Y^ provided by (^). 

The largest shattered sets found by exhaustive 
and partial enumerations share a common feature: 
They can be transformed into quasi-orthogonal 



sets, i.e. into sets, where the patterns have mini- 
mum pairwise overlap^, 



±1 
N 



(1 = 1/ 



(12) 



This observation leads to the idea of restricting the 
enumeration to quasi orthogonal pattern sets. 

To find such pattern sets, the notion of 
Hadamard-matrices is useful (see e.g. [|ll| or any 
texbook on combinatorics or coding theory). A 
Hadamard matrix is an m x m-matrix H with ±1- 
entries such that 



HH' = ml 



(13) 



where / is the m x m identity matrix. The rows 
(or columns) of a Hadamard matrix form a set of 
m orthogonal binary patterns! This implies that 
m must be even, but the whole truth is more re- 
strictive: If H is an mx m Hadamard matrix, then 
TO = l,m = 2orm = mod 4. The reversal is 
a famous open question: Is there a Hadamard ma- 
trix of order m — An for every positive nl The first 
open case is m = 428. 

For special values of m there are rules to con- 
struct Hadamard matrices e.g.: 

• m = 2" (Sylvester type) 

• m = q + 1 where g is a prime power and q = 
3 mod 4 (Paley type) . 

• m = 2{q + 1) where g is a prime power and 
q = 3 mod 4 (Paley type). 

These rules provide us with Hadamard matrices of 
sufficient size^. To get from a 4n x 4n Hadamard 
matrix to quasi orthogonal binary patterns we ei- 
ther cut out one column (A^ = 4n — 1) or add an 
arbitrary column (A^ = 4n + 1) and take the rows 
of the resulting matrix as patterns. The pattern 
set (11) is a result of this procedure applied to the 
8x8 Hadamard matrix Hs of Sylvester type: 



with 



Ho 



-1 -1 
-1 +1 



(14) 



(15) 



® denotes the usual Kronecker product. 

The restriction to quasi orthogonal pattern sets 
allows us to consider larger values of A^, but now 
the enumeration gives only lower bounds for dvc- 
Results for A^ < 31 are displayed in figure |l]. The 

^ Exact orthogonality cannot be achieved for A'^ odd. 
^The first value of m = 4n where none of them applies is 
m = 92. 
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Figure 1: VC-dimension of the Ising perceptron 
with binary patterns vs. TV. dvc = N is an up- 
per bound, dvc = "^^^2^ is a lower bound provided 
by the set (|). 

lower bound dye — -^^^ achieved by the set (|^) 
is exceeded for all TV > 5, but the theoretic upper 
bound dvc = N is attained only for N = 7. The 
data are not suited for a decent conjecture about 
a general expression for dyciN). Even the mere 
existence of a well defined asymptotic behaviour 
for N oo looks questionable. The VC dimension 
seems to be sensitive not only to the size but also to 
the numbertheoretic properties of N: We observe 
a jmnp in dvc{N) at = 2" — 1, i.e. at values of 
N where the corresponding Hadamard matrix is of 
Sylvester type. 

The lower bounds in figure ^ do not rule out the 
possibility of a much more regular behaviour of the 
true dvciN), including well defined asymptotics. 
However, if the limit linijv^oo exists, it will be 
probably larger than 0.5. 

Acknowledgements. The author appreciates 
fruitful discussions with A. Engel. Thanks to 
C. Bessenrodt for her reference to Hadamard- 
matrices. 



learning using information theory and the VC 
dimension. In Proceedings COLT'91, San Ma- 
teo, 1991. Morgan Kaufmann. 

[4] J. M. R. Parrondo and C. van den Broeck. 
Vapnik-Chervonenkis bounds for generaliza- 
tion. J. Phys. A 26, page 2211, 1993. 

[5] A. Engel. Uniform convergence bounds for 
learning from examples. Mod. Phys. Lett. B 
8, page 1683, 1994. 

[6] N. Sauer. On the density of famihes of sets. J. 
Combinatorial Theory A 13, page 145, 1972. 

[7] T. M. Cover. Geometrical and statistical prop- 
erties of systems of linear inequalities with ap- 
plications in pattern recognition. IEEE Trans. 
Electron. Comput. EC 14, page 326, 1965. 

[8] S. Mertens and A. Engel. On the VC- 
dimension of neural networks with binary 
weights. To be published, 1996. 

[9] A. Engel and M. Weigt. Multifractal analysis 
of the coupling space of feed-forward neural 
networks. To appear in Phys. Rev. E, 1996. 

[10] J. Stambke. 1992. Diploma-Thesis, University 
of Giessen. 

[11] O. Krisement. A Hopfield model with 
Hadamard prototypes. Z. Phys. B 80, page 
415, 1990. 

[12] T. Beth, D. Jungnickel, and H. Lenz. Design 
Theory, chapter 1.9. Bibliograpisches Institut, 
Mannheim, 1985. 



References 

[1] V. N. Vapnik and A. Y. Chervonenkis. On the 
uniform convergence of relative frequencies of 
events to their probabilities. Th. Prob. Appl. 
16, page 264, 1971. 

[2] V. N. Vapnik. Estimation of dependences based 
on empirical data. Springer, Berlin, 1982. 

[3] D. Haussler, M. Kearns, and R. Schapire. 
Bounds on the sample complexity of Bayesian 



4 



